Data Description

Background

AllLife Bank has a growing customer base. Majority of these customers are liability customers (depositors) with varying size of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with a minimal budget.

The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.

Context

This casestudy is about a bank where the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with minimal budget.

Objectives

The classification goal is to predict the likelihood of a liability customer buying personal loans which means we have to build a model which will be used to predict which customer will most likely to accept the offer for personal loan, based on the specific relationship with the bank across various features given in the dataset.

This means:

  1. To predict whether a liability customer will buy a personal loan or not.
  2. Which variables are most significant.
  3. Which segment of customers should be targeted more.

Data Dictionary

Problem Statement:

The department wants to build a model that will help them identify the potential customers who have higher probability of purchasing the loan. This will increase the success ratio while at the same time reduce the cost of the campaign.

Importing the necessary libraries

I have used numpy, pandas, matplotlib, seaborn, scipy for EDA and Data Visualization. Also used sklearn for data spliting, model building and for confusion matrix.

Importing the Data Frame

Exploratory Data Analysis

View the first and last 5 rows of the dataset

From the above data frame I can deduce the following:

Nonimal Varibles :

Ordinal Categorical variables :

Interval Variables :

Shape of the dataset

We have 13 independent variables and 1 dependent variable i.e. ‘Personal Loan’ in the data set. Also, we got 5000 rows which can be split into test & train datasets.

Data type of each attribute

Here we can see that all the variables are numerical. But the columns 'CD Account', 'Online', 'Family', 'Education' , 'CreditCard' and 'Securities Account' are categorical variables which should be in 'category' type.

Converting the data types

Summary of the dataset

Checking for missing values

Here total missing values and null count from each column is 0 and we can see there is no missing or null values in the dataframe.

Checking for unique data

Transposing Index and Columns

Observation:

Note!!

Trying to Fix the Negative Values in "Experience" column (Data Cleaning)

This means in total we have 52 negatives values present in "Experience" column

So, 52 values in Experience is replaced by nan value.Now we will fill it with the median values.

Now we generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Pairplot which includes all the columns in the dataframe

The 'ID' column is irrelevant in the pairplot because its a column for record index

Observations

From the above pair plot we can infer the association among the attributes and target column as follows:

Checking the association of Experience with other quantitive variables

From the above graphs it can be deduced that "Age" has a really strong and positive associaton with "Experience" column. We can also consider 'Education' to fix the negative experience error because the experience relates to education level.

Relationship between the attributes:

Experience and age are in linear relationship. One can be dropped among them will not effect the accuracy.

Converting the negative values in "Experience" column to 0

The steps I would be following and how I can come up with the codes to convert the negative values to 0:

It shows that the negative value count has now become 0 which means there is no negative values anymore in the dataframe

Also, if we describe the 'Experienece' column to check the count, mean, standard deviation and the five point summary.

Observe: now you can see the min is 0.0 which was -3.0 before error fixing.

Univariate Analysis

ID

From the above graph it displays that 'ID' is unformly distributed.

Age

Income

Observation: The above distributionis is right skewed distribution because the tail goes to the right.

Zip Code

CCAvg

Observation:

It shows that the distribution is rightly skewed because the tails is at the right. This means that most of the customers' monthly average spending on credit cards is usually between 1k to 2.5k. However, there are few customers whereby their monthly average spending on credit card is >8k.

Education

Observation

Mortgage

Online

Credit Card

Personal Loan

Since the dataset is talking about customers taking and legible for "Personal Loans". This means this column will be the target column for distribution and will mostly be associated with the other columns to provide more insights

This reveals that out of 5000 data points:

This could mean that the percentage of customers who took the loan is significantlly greater than customers who did take the loan.

Observation

From the pie chart it shows that the data is really biased (almost 1:10) in respect to the customers in category of not accepting personal loan. Therefore we could buil an assesment model which could perform better towards predicting which customers will either be accepting personal loans or not. Hence, our goal should be to identify the customers who can accept personal loans based on the given features.

General Observatintions from the univariate analysis

Multivariate Analysis

We are going to see how the other features has an influnece on Personal Loan.

Also here are some hypothesis which was generated earlier on :

These hypothesis that was generated will be tested using the multivariate analysis with respect to the target variables

Family vs Personal Loan

Observations

Education

Observations:

CD_Account

Observations:

Credit Card

Observation

Securities_Account

Observation:

Online

Numerical Variables Vs Target Variable

The Numerical variables (‘Age’, ‘CC_Avg’ ‘Income’, ‘Mortgage’, ‘Experience’) vs Target variable (‘Personal_Loan’). Then we will find the mean of the numeric independent variable for which the customers that buy the personal loan vs the mean of the numeric variables of the customers who do not.

Age vs Personal Loan

Observation:

CCAvg vs Personal Loan

Now we are going to check the Personal Loan buyers average spent on their credit cars monthly. We would also use the groupby function here

Again, here the y-axis represents the mean of customers spending on their credit cars monthly. This reveals just as the same inference as above which is the customers who have credit cards and monthly spending which are high are more likley to get a loan.

This shows that customers who have taken personal loan have higher credit card average than those who did not take loan. So high credit card average seems to be good predictor of whether or not a customer will take a personal loan.

Income

Now we will check how the Income of a Customer affect the possibility of how liable the customer is.

Outlier Detection and Treatment

Correlation using Heatmap

Observation from the Heatmap:

After I had removed the outliers the dataset dropped 100+ rows which contained outliers and now my model is ready to be built!

Target Distribution

Before we begin with the model building, I want to recall the target variable which would be used to create insights.

Observation

Dropping columns

From above it shows us that 'ID' and "ZIP Code' are irrelevant for our model building so we will drop it.

Age and Experience are also highly correlated so we can build our model from either

Model Building

It is easier to deduce insights from the "AGE" column from the multivariate analysis so it does not give interesting insights and also the age bins would be a lot of analysis to compare with which would not aid in quick results and Age might be be correlated finely with the target variable. However, it may not correlate so good with other variables in the dataset.

I have chosen to use experience because it is preferrably easier for me because it is categorised into TWO that is:

Creating two new dataframes for the model building which are 'With Eperience' and 'Without Experience' mutually

Seperating the Target Variable from the Independent Variables from the Two new dataframes

Spliting the data into training and test set in the ratio of 70:30 respectively

Logistic Regression

Using the Experience Column

Using Without Experience Column

Below is the comparison b/w Logistic Regression Model Accuracy and Confussion Matrix With 'Experience' and Without 'Experience'.

Observations

Improvement of the model

Iteration 2 For Logistic Regression with Experience

Observation:

The dataset shows that 74% of the promoted term deposit were the term deposit that the customers liked. Of the entire test set, 74% of the customer’s preferred term deposits that were promoted.

Logistic Results

Decision Trees

Build Decision Tree Model

We will build our model using the DecisionTreeClassifier function. Using default 'gini' and entropy criteria to split.

Scoring our Decision Tree

Visualisation of the Decision

Observation

Pruning

Cost Complexity Pruning

We are going to train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

Recall vs alpha for training and testing sets

Confusion Matrix

Visualizing the Decision Tree

Income, Education, Family and CCAvg are still the top important features. However, there has been a shift in which one has more relative importance and initially Income used to be the most important feature but after the pruning excercise, it reveals that Education has more importance relatively.

Therefore, Monthly Income and Education is the most significant factor that decides personal loan

Comparing all the decision tree models

Final Observation From the Model Building

Conclusion

The aim of the universal bank is to convert there liability customers into loan customers. They want to set up a new marketing campaign; hence, they need information about the connection between the variables given in the data. Two classification algorithms were used in this study. From the above graph , it seems like Decision Tree algorithm have the highest accuracy and we can choose that as our final model